depth function
A Trainable Centrality Framework for Modern Data
Vu, Minh Duc, Liu, Mingshuo, Zhou, Doudou
Measuring how central or typical a data point is underpins robust estimation, ranking, and outlier detection, but classical depth notions become expensive and unstable in high dimensions and are hard to extend beyond Euclidean data. We introduce Fused Unified centrality Score Estimation (FUSE), a neural centrality framework that operates on top of arbitrary representations. FUSE combines a global head, trained from pairwise distance-based comparisons to learn an anchor-free centrality score, with a local head, trained by denoising score matching to approximate a smoothed log-density potential. A single parameter between 0 and 1 interpolates between these calibrated signals, yielding depth-like centrality from different views via one forward pass. Across synthetic distributions, real images, time series, and text data, and standard outlier detection benchmarks, FUSE recovers meaningful classical ordering, reveals multi-scale geometric structures, and attains competitive performance with strong classical baselines while remaining simple and efficient.
seMCD: Sequentially implemented Monte Carlo depth computation with statistical guarantees
Gnettner, Felix, Kirch, Claudia, Nieto-Reyes, Alicia
Statistical depth functions provide center-outward orderings in spaces of dimension larger than one, where a natural ordering does not exist. The numerical evaluation of such depth functions can be computationally prohibitive, even for relatively low dimensions. We present a novel sequentially implemented Monte Carlo methodology for the computation of, theoretical and empirical, depth functions and related quantities (seMCD), that outputs an interval, a so-called seMCD-bucket, to which the quantity of interest belongs with a high probability prespecified by the user. For specific classes of depth functions, we adapt algorithms from sequential testing, providing finite-sample guarantees. For depth functions dependent on unknown distributions, we offer asymptotic guarantees using non-parametric statistical methods. In contrast to plain-vanilla Monte Carlo methodology the number of samples required in the algorithm is random but typically much smaller than standard choices suggested in the literature. The seMCD method can be applied to various depth functions, covering multivariate and functional spaces. We demonstrate the efficiency and reliability of our approach through empirical studies, highlighting its applicability in outlier or anomaly detection, classification, and depth region computation. In conclusion, the seMCD-algorithm can achieve accurate depth approximations with few Monte Carlo samples while maintaining rigorous statistical guarantees.
Theoretical Investigation on Inductive Bias of Isolation Forest
Zheng, Qin-Cheng, Zhang, Shao-Qun, Lyu, Shen-Huan, Jiang, Yuan, Zhou, Zhi-Hua
Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector valued for its exceptional runtime efficiency and performance on large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest's success remains unclear. This paper theoretically investigates the conditions and extent of iForest's effectiveness by analyzing its inductive bias through the formulation of depth functions and growth processes. Since directly analyzing the depth function proves intractable due to iForest's random splitting mechanism, we model the growth process of iForest as a random walk, enabling us to derive the expected depth function using transition probabilities. Our case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to $k$-Nearest Neighbor anomaly detectors. Our study provides theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.
Partial Rankings of Optimizers
Rodemann, Julian, Blocher, Hannah
We introduce a framework for benchmarking optimizers according to multiple criteria over various test functions. Based on a recently introduced union-free generic depth function for partial orders/rankings, it fully exploits the ordinal information and allows for incomparability. Our method describes the distribution of all partial orders/rankings, avoiding the notorious shortcomings of aggregation. This permits to identify test functions that produce central or outlying rankings of optimizers and to assess the quality of benchmarking suites. Despite its importance for machine learning research, there is no broad agreement on how to compare optimization algorithms on benchmark suites with regard to multiple criteria, see Hansen et al. (2022) for instance. This is particularly relevant for multi-objective optimization, which has diverse applications ranging from reinforcement learning (Basaklar et al., 2023; Zhu et al., 2023) to representation learning (Gu et al., 2023), neural architecture search (Lu et al., 2019) and large language models (Zhou et al., 2023). But such comparisons also arise when single-objective optimizers are evaluated with respect to several metrics, see Sivaprasad et al. (2020); Mattson et al. (2020); Dahl et al. (2023). A popular example is the duality of fixed-budget (performance) and fixed-target (speed) evaluation of deep learning optimizers, see e.g. In this work, we propose a novel framework for comparing optimizers with respect to multiple criteria over a benchmarking suite of test functions.
Fast kernel half-space depth for data with non-convex supports
Castellanos, Arturo, Mozharovskyi, Pavlo, d'Alché-Buc, Florence, Janati, Hicham
Data depth is a statistical function that generalizes order and quantiles to the multivariate setting and beyond, with applications spanning over descriptive and visual statistics, anomaly detection, testing, etc. The celebrated halfspace depth exploits data geometry via an optimization program to deliver properties of invariances, robustness, and non-parametricity. Nevertheless, it implicitly assumes convex data supports and requires exponential computational cost. To tackle distribution's multimodality, we extend the halfspace depth in a Reproducing Kernel Hilbert Space (RKHS). We show that the obtained depth is intuitive and establish its consistency with provable concentration bounds that allow for homogeneity testing. The proposed depth can be computed using manifold gradient making faster than halfspace depth by several orders of magnitude. The performance of our depth is demonstrated through numerical simulations as well as applications such as anomaly detection on real data and homogeneity testing.
Comparing Machine Learning Algorithms by Union-Free Generic Depth
Blocher, Hannah, Schollmeyer, Georg, Nalenz, Malte, Jansen, Christoph
We propose a framework for descriptively analyzing sets of partial orders based on the concept of depth functions. Despite intensive studies in linear and metric spaces, there is very little discussion on depth functions for non-standard data types such as partial orders. We introduce an adaptation of the well-known simplicial depth to the set of all partial orders, the union-free generic (ufg) depth. Moreover, we utilize our ufg depth for a comparison of machine learning algorithms based on multidimensional performance measures. Concretely, we provide two examples of classifier comparisons on samples of standard benchmark data sets. Our results demonstrate promisingly the wide variety of different analysis approaches based on ufg methods. Furthermore, the examples outline that our approach differs substantially from existing benchmarking approaches, and thus adds a new perspective to the vivid debate on classifier comparison.
Extreme Scenario Selection in Day-Ahead Power Grid Operational Planning
Terrén-Serrano, Guillermo, Ludkovski, Michael
We propose and analyze the application of statistical functional depth metrics for the selection of extreme scenarios in day-ahead grid planning. Our primary motivation is screening of probabilistic scenarios for realized load and renewable generation, in order to identify scenarios most relevant for operational risk mitigation. To handle the high-dimensionality of the scenarios across asset classes and intra-day periods, we employ functional measures of depth to sub-select outlying scenarios that are most likely to be the riskiest for the grid operation. We investigate a range of functional depth measures, as well as a range of operational risks, including load shedding, operational costs, reserves shortfall and variable renewable energy curtailment. The effectiveness of the proposed screening approach is demonstrated through a case study on the realistic Texas-7k grid.
Depth Functions for Partial Orders with a Descriptive Analysis of Machine Learning Algorithms
Blocher, Hannah, Schollmeyer, Georg, Jansen, Christoph, Nalenz, Malte
We propose a framework for descriptively analyzing sets of partial orders based on the concept of depth functions. Despite intensive studies of depth functions in linear and metric spaces, there is very little discussion on depth functions for non-standard data types such as partial orders. We introduce an adaptation of the well-known simplicial depth to the set of all partial orders, the union-free generic (ufg) depth. Moreover, we utilize our ufg depth for a comparison of machine learning algorithms based on multidimensional performance measures. Concretely, we analyze the distribution of different classifier performances over a sample of standard benchmark data sets. Our results promisingly demonstrate that our approach differs substantially from existing benchmarking approaches and, therefore, adds a new perspective to the vivid debate on the comparison of classifiers.
Concentration of the exponential mechanism and differentially private multivariate medians
Ramsay, Kelly, Jagannath, Aukosh, Chenouri, Shoja'eddin
We prove concentration inequalities for the output of the exponential mechanism about the maximizer of the population objective function. This bound applies to objective functions that satisfy a mild regularity condition. To illustrate our result, we study the problem of differentially private multivariate median estimation. We present novel finite-sample performance guarantees for differentially private multivariate depth-based medians which are essentially sharp. Our results cover commonly used depth functions, such as the halfspace (or Tukey) depth, spatial depth, and the integrated dual depth. We show that under Cauchy marginals, the cost of heavy-tailed location estimation outweighs the cost of privacy. We demonstrate our results numerically using a Gaussian contamination model in dimensions up to $d = 100$, and compare them to a state-of-the-art private mean estimation algorithm.
Statistical Depth Functions for Ranking Distributions: Definitions, Statistical Learning and Applications
Goibert, Morgane, Clémençon, Stéphan, Irurozki, Ekhine, Mozharovskyi, Pavlo
The concept of median/consensus has been widely investigated in order to provide a statistical summary of ranking data, i.e. realizations of a random permutation $\Sigma$ of a finite set, $\{1,\; \ldots,\; n\}$ with $n\geq 1$ say. As it sheds light onto only one aspect of $\Sigma$'s distribution $P$, it may neglect other informative features. It is the purpose of this paper to define analogs of quantiles, ranks and statistical procedures based on such quantities for the analysis of ranking data by means of a metric-based notion of depth function on the symmetric group. Overcoming the absence of vector space structure on $\mathfrak{S}_n$, the latter defines a center-outward ordering of the permutations in the support of $P$ and extends the classic metric-based formulation of consensus ranking (medians corresponding then to the deepest permutations). The axiomatic properties that ranking depths should ideally possess are listed, while computational and generalization issues are studied at length. Beyond the theoretical analysis carried out, the relevance of the novel concepts and methods introduced for a wide variety of statistical tasks are also supported by numerous numerical experiments.